DOI Emma Rand. (2022). Data Analysis in R (BIO00017C) 2020: 2022 (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6359475

Introduction

Session overview

In this workshop you will get practice in choosing between, performing, and presenting the results of one-way ANOVA and Kruskal-Wallis in R.

Learning Outcomes

By actively following the materials and carrying out the independent study before and after the contact hours the successful student will be able to:

  • Explain the rationale behind ANOVA (MLO 1 and 2)
  • Know the relationship between the sums of squares (\(SS\)), the mean squares (\(MS\)) and the \(F\) value
  • Select, appropriately,one-way ANOVA and Kruskal-Wallis including post-hoc tests (MLO 2)
  • Apply and interpret the tests in R (MLO 3 and 4)
  • Evaluate whether the assumptions of a test are met
  • Summarise and illustrate with appropriate R figures test results scientifically (MLO 3 and 4)

Philosophy

Workshops are not a test. It is expected that you often don’t know how to start, make a lot of mistakes and need help. Do not be put off and don’t let what you can not do interfere with what you can do. You will benefit from collaborating with others and/or discussing your results. It is expected that you are familiar with independent study content before the workshop. However, you need not remember or understand every detail as the workshop should build and consolidate your understanding. You may wish to refer to the independent study materials for reference.

Materials are indexed here: https://3mmarand.github.io/BIO00017C-Data-Analysis-in-R-2020/

Key

These four symbols are used at the beginning of each instruction so you know where to carry out the instruction.

W is something you need to do on your computer. It may be opening programs or documents or locating a file.

R is something you should do in RStudio. It will often be typing a command or using the menus but might also be creating folders, locating or moving files.

GC is something you should do in your browser on the internet. It may be searching for information, going to the VLE or downloading a file.

Q is question for you to think about an answer. You will usually want to record your answers in your script for future reference.

Artwork by Allison Horst

Getting started

W Start RStudio from the Start menu.

R Make an RStudio project for this workshop by clicking on the drop-down menu on top right where it says Project: (None) and choosing New Project and then New Directory, then New Project. Navigate to the β€œdata-analysis-in-r” folder. Name the RStudio Project β€˜workshop5’.

R On the Files tab click on New Folder. In the box that appears type β€œdata”. This will be the folder that we save data files to.

R Make a new script then save it with a name like analysis.R to carry out the rest of the work.

R Load the tidyverse:

library(tidyverse)

Exercises

Myoglobin in seal muscle

The myoglobin concentration of skeletal muscle of three species of seal in grams per kilogram of muscle was determined and the data are given in seal.csv. We want to know if there is a difference between species. Each row represents an individual seal. The first column gives the myoglobin concentration and the second column indicates species.

W Save a copy of the data file seal.csv

W Compare seal.csv and adipocytes.txt by opening the file. You can do this in RStudio by clicking on the files in the files window.

Q How do the data files differ? How might this influence your data import?

R Read in the data and check the structure. I used the name seal for the dataframe/tibble. What kind of variables do you have?

Exploring

R Do a quick plot of the data.You may need to refer to a previous workshop

Summarising the data

Do you remember this Top Tip?

Top Tip

The code required to summarise, and plot data for any two-sample t-test AND for any for any one-way ANOVA is exactly the same except for the names of the dataframe, variables and the axis labels and limits. Take some time to comment it. Consider making a script called ttest.R or similar with all the code and information you need to reuse it.

R If you followed that tip you’ll be able to open that script and whizz through summarising and plotting. Whether you did or did not, why not make a one-way-ANOVA.R script to help future you.

R Create a data frame called sealsummary that contains the means, standard deviations, sample sizes and standard errors for each species. You may want to look this up in your scripts from the previous workshops.

You should get the following numbers:

species mean std n se
Bladdernose Seal 42.31600 8.020634 30 1.464361
Harbour Seal 49.01033 8.252004 30 1.506603
Weddell Seal 44.66033 7.849816 30 1.433174

Applying, interpreting and reporting

We can now carry out a one-way ANOVA. When doing an ANOVA it is common to assign the result of the aov() procedure to a variable, then examine it using summary()

R Carry out an ANOVA and examine the results with:

mod <- aov(data = seal, myoglobin ~ species)
summary(mod)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## species      2    692   346.2   5.352 0.00643 **
## Residuals   87   5627    64.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Remember: the tilde (~) means test the values in myoglobin when grouped by the values in species. Or explain myoglobin with species

Q What do you conclude so far from the test? Write your conclusion in a form suitable for a report.

The ANOVA is significant but which means differ? We need a post-hoc test. A post-hoc (β€œafter this”) test is done after (and only after) a significant ANOVA test. The ANOVA tells you at least two of means differ, the post-hoc test tells you where the differences are. There are several possible post-hoc tests. A popular option is the Tukey Honest Significant Difference test.

R Carry out a Tukey HSD with:

TukeyHSD(mod)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = myoglobin ~ species, data = seal)
## 
## $species
##                                    diff       lwr       upr     p adj
## Harbour Seal-Bladdernose Seal  6.694333  1.742803 11.645863 0.0050207
## Weddell Seal-Bladdernose Seal  2.344333 -2.607197  7.295863 0.4989642
## Weddell Seal-Harbour Seal     -4.350000 -9.301530  0.601530 0.0968115

Each row is a comparison between means. The β€˜diff’ column is the difference between the means the β€˜p adj’ column indicates whether that difference is significant.

A plot can be used to visualise the result of the post hoc and make it easier to understand

R Use the generic plot() function to plot the post-hoc:

TukeyHSD(mod) %>% plot(cex.axis = 0.7)   # cex.axis just changes the size of the axis labels

You may want to zoom. This shows the confidence intervals on the differences between the means and the dashed vertical line is at zero so if the confidence interval crosses that line (i.e., the C.I. includes zero) then the means do not differ significantly.

Q What do you conclude from the test?

Check assumptions

We need to examine the residuals. Unlike with t.test we don’t have to calculate them - the object which is created by aov() contains a variable called $residuals.

Also conveniently, the R’s plot() function can used on the output objects of aov()

R Plot the model residuals against the fitted values like this:

plot(mod, which = 1)

The group means are the fitted (or predicted) values; each residual is the difference between the mean and the actual value.

Q What to you conclude?

R To examine normality of the model residuals we can do a histogram

hist(mod$residuals)

R Use the shapiro.test() on the model residuals

Q What to you conclude? -

Illustrating

We will produce a figure to go with a significant ANOVA in a report using ggplot2.

Look at the figure we did in the last practical for adiponectin concentration of control and nicotinic acid treated adipocytes.

Notice how the fundamental structure of the plot is the same as we require here, there is just a different number of groups.

We will again use both our seal and sealsummary dataframes.

R Create the plot:

ggplot() +
  geom_point(data = seal, aes(x = species, y = myoglobin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray50") +
  geom_errorbar(data = sealsummary, 
                aes(x = species, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = sealsummary, 
                aes(x = species, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = expression(Myoglobin~concentration~g~Kg^{-1}),
                     limits = c(0, 80), 
                     expand = c(0, 0)) +
  scale_x_discrete(labels = c("Bladdernose", "Harbour", "Weddell"), 
                   name = "Seal Species") +
  theme_classic()

Notice the use of expression() to allow you to specify special characters. expression() takes strings or LaTeX formatting. Each string or piece of LaTeX is separated by a * or a ~. The * puts them together without a space, and ~ with a space. Kg^{-1}makes the -1 a superscript.

This figure is good …. but it would be nice to show the result of the post-hoc test by annotating the figure.

We can add annotation to a ggplot using annotate()

R Look up annotate() in the manual:

?annotate

The examples section might be useful to aid understanding. annotate() takes a geom as its first argument. This can be text segment rect etc. The other arguments are positioning aesthetics which say where the annotation should be placed. We need three segments: two short vertical lines and one long horizontal line. We also need one piece of text to give the \(p\) value.

R Add the annotations to the plot:

ggplot() +
  geom_point(data = seal, aes(x = species, y = myoglobin),
             position = position_jitter(width = 0.1, height = 0),
             colour = "gray50") +
  geom_errorbar(data = sealsummary, 
                aes(x = species, ymin = mean - se, ymax = mean + se),
                width = 0.3) +
  geom_errorbar(data = sealsummary, 
                aes(x = species, ymin = mean, ymax = mean),
                width = 0.2) +
  scale_y_continuous(name = expression(Myoglobin~concentration~g~Kg^{-1}),
                     limits = c(0, 80), 
                     expand = c(0, 0)) +
  scale_x_discrete(labels = c("Bladdernose", "Harbour", "Weddell"), 
                   name = "Seal Species") +
  # long horizontal. goes from bladdernose (x = 1) to harbour (xend = 2) 
  # the y and yend are the same
  annotate("segment", x = 1, xend = 2,   
           y = 72, yend = 72, 
           colour = "black") +
  # short horizontal, x and xend are the same at harbour (xend = 2)
  # y and yend are slightly apart
  annotate("segment", x = 2, xend = 2, 
           y = 72, yend = 70,
           colour = "black") +
  # short horizontal, x and xend are the same at bladdernose (x = 1)
  # y and yend are slightly apart
  annotate("segment", x = 1, xend = 1,
           y = 72, yend = 70,
           colour = "black") +
  # the text
  annotate("text", x = 1.5,  y = 75,
           label = expression(italic(p)~"= 0.005")) +
  theme_classic()

Top Tip

Use comments anywhere in the ggplot block to help you remember what each part does. I find this especially useful for labelled the annotations.

R Make a new folder called β€˜figures’ and write your figure to file. This will allow you to check your understanding of paths.

Leafminers on Birch

Larvae of the Ambermarked birch leafminer, Profenusa thomsoni, feed on the interior leaf tissues of Birch (Betula) species. They do not normally kill the tree but can weaken it making it susceptible to attack from other species. Researchers are interested in whether there is a difference in the rates at which white, grey and yellow birch are attacked. They introduce adult female P.thomsoni to a green house containing 30 young trees (ten of each type) and later count the egg laying events on each tree. The data are in leaf.txt.

Exploring

R Read in the data and check the structure. I used the name leaf for the dataframe/tibble. What kind of variables do you have?

R Do a quick plot of the data.You may need to refer to a previous workshop

Q Using your common sense, do these data look normally distributed?

Q Why is a Kruskal-Wallis appropriate in this case?

R Calculate the medians, means and sample sizes.

Applying, interpreting and reporting

R Carry out a Kruskal-Wallis:

kruskal.test(data = leaf, eggs ~ birch)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  eggs by birch
## Kruskal-Wallis chi-squared = 6.3393, df = 2, p-value = 0.04202

Q What do you conclude from the test?

The significant Kruskal-Wallis tells us at least two of the groups differ but where do the differences lie? A post-hoc multiple comparison test for a significant Kruskal-Wallis exists in the pgirmess package.

R Load the package using library()

R Run the post-hoc test with:

kruskalmc(data = leaf, eggs ~ birch)
## Multiple comparison test after Kruskal-Wallis 
## p.value: 0.05 
## Comparisons
##              obs.dif critical.dif difference
## Grey-White       5.1     9.425108      FALSE
## Grey-Yellow      4.8     9.425108      FALSE
## White-Yellow     9.9     9.425108       TRUE

The obs.diff column gives the mean rank for each group, critical.dif is how big the difference between the means ranks must be for significance at the 0.05 level (by default) and the final column tells you if the obs.diff is greater than the critical.dif.

Top Tip

You can use the probs argument to change the signifcance value the kruskalmc() test uses and by iteration, get a \(p\) value, for example: kruskalmc(data = leaf, eggs ~ birch, probs = 0.04) kruskalmc(data = leaf, eggs ~ birch, probs = 0.03)

Q What do you conclude from the test?

Q Write up the result is a form suitable for a report.

Illustrating

R A box plot is an appropriate choice for illustrating a Kruskal-Wallis. Can you produce a figure like this?

Top Tip

Look at your script for last practical for the plot we did for the grouse data.

πŸŽ‚ Well Done! πŸŽ‚

Independent study following the workshop

Decide which test you should use to analyse the each following data sets. In each case give the reasons for your choice of test and state the null hypothesis. Write your conclusions in a form suitable for including in a report. Can you make figures?

1. Effects of fitness and heat acclimatisation

Sports scientists were investigating the effects of fitness and heat acclimatisation on the sodium content of sweat. They measured the sodium content of the sweat (\(\mu mol\mspace{3mu}l^{-1}\)) of three groups of individuals: unfit and unacclimatised (UU); fit and unacclimatised(FU); and fit and acclimatised (FA). The are in sweat.txt. Is there a difference between the groups in the sodium content of their sweat?

2. Insecticide effectiveness

The data are given in biomass.txt are taken from an experiment in which the insect pest biomass (g) was measured on plots sprayed with water (control) or one of five different insecticides. Do the insecticides vary in their effectiveness? What advice would you give to a person: - currently using insecticide E? - trying to choose between A and D? - trying to choose between C and B?

The Code files

These contain all the code needed in the workshop even where it is not visible on the webpage.

Rmd file The Rmd file is the file I use to compile the practical. Rmd stands for R markdown. It allows R code and ordinary text to be interweaved to produce well-formatted reports including webpages. If you right-click on the link and choose Save-As, you will be able to open the Rmd file in RStudio. Alternatively, View in Browser.

Plain script file This is plain script (.R) version of the practical generated from the Rmd. Again, you can save this and open it RStudio. Alternatively, View in Browser.

Pages made with rmarkdown (Allaire Xie, et al., 2019a; Xie Allaire, et al., 2018a), kableExtra (Zhu, 2019a), RefManager (McLean, 2014)

References

Allaire, J., Y. Xie, et al. (2019a). rmarkdown: Dynamic Documents for R. R package version 1.16. URL: https://github.com/rstudio/rmarkdown.

McLean, M. W. (2014). Straightforward Bibliography Management in R Using the RefManager Package. arXiv: 1403.2036 [cs.DL]. URL: https://arxiv.org/abs/1403.2036.

Xie, Y., J. Allaire, et al. (2018a). R Markdown: The Definitive Guide. ISBN 9781138359338. Boca Raton, Florida: Chapman and Hall/CRC. URL: https://bookdown.org/yihui/rmarkdown.

Zhu, H. (2019a). kableExtra: Construct Complex Table with β€˜kable’ and Pipe Syntax. R package version 1.1.0. URL: https://CRAN.R-project.org/package=kableExtra.

Please cite as:

Emma Rand. (2022). Data Analysis in R (BIO00017C) 2020: 2022 (v1.1). Zenodo. https://doi.org/10.5281/zenodo.6359475